Search CORE

28 research outputs found

Graphe de Chevauchements pour les Algorithmes d'Assemblage et de Scaffolding : Etat de l'Art des Paradigmes et Propositions d'Implementations

Author: Epain Victor
Publication venue: HAL CCSD
Publication date: 01/01/2022
Field of study

Assembling Deoxyribonucleic Acid (DNA) fragments based on their overlaps remains the main assembly paradigm with long DNA fragments sequencing technologies, independently of the aim to resolve only one or several haplotypes. Since an overlap can be seen as a succession relationship between two oriented fragments, the directed graph structure has emerged as the more appropriate data structure for handling overlaps. However, this graph paradigm did not appear to take benefit of the reverse symmetry of the orientated fragments and their overlaps, which is a result of blind DNA doublestrand sequencing. Thus, the bi-directed graph paradigm was introduced to be the one that reduces the graph size by handling the reverse symmetry, and since becomes the mainly used graph paradigm. Nevertheless, graph paradigms have never been contrasted before, and no implementations were described. Here we make a complete review on the existing overlap graph paradigms. Furthermore, we present different implementations that are theoretically compared in terms of memory, and their impact on the design and on the time of some basic graph algorithms. We also show that by adapting close logic implementations, a graph paradigm can be switched to another

INRIA a CCSD electronic archive server

Integer Programming Approach for Nested Pairs Genome Scaffolding

Author: Andonov Rumen
Epain Victor
Publication venue: HAL CCSD
Publication date: 18/03/2022
Field of study

Scaffolding step in the genome assembly aims to determine the order and the orientation of a huge number of previously assembled genomic fractions (contigs/scaffolds). Here we introduce a particular case of this problem and denote it by Nested Pairs Scaffolding. We formulate it as an optimisation problem and propose an integer programming formulation for its resolution. The performed computational results on real and synthetic data show an excellent behaviour of our formulation

INRIA a CCSD electronic archive server

Graphes de Chevauchements à destination d'Algorithmes d'Assemblage et de Scaffolding : Retours sur les Paradigmes et Propositions d'Implémentations

Author: Andonov Rumen
Epain Victor
Publication venue: HAL CCSD
Publication date: 14/10/2022
Field of study

Assembling DNA fragments based on their overlaps remains the main assembly paradigm with long DNA fragments sequencing technologies, independently of the aim to resolve only one or several haplotypes. Since an overlap can be seen as a succession relationship between two oriented fragments, the directed graph structure has emerged as an appropriate data structure for handling overlaps. However, this graph paradigm does not appear to take benefit of the reverse symmetry of the orientated fragments and their overlaps, which is a result of blind DNA double-strand sequencing. Thus, the bi-directed graph paradigm was introduced in 1995 towards reducing the graph size by handling the reverse symmetry, and becomes since then the main graph paradigm used in assembly/scaffolding methods. Nevertheless, the available graph paradigms have never been contrasted before, and no implementations have been described. Here we make a complete review on the existing overlap graph paradigms. Furthermore, we present suitable data structures that are theoretically compared in terms of time and memory consumption in the context of the design of some basic graph algorithms. We also show that each one of the paradigms can be switched to another by slightly modifying their data structures

INRIA a CCSD electronic archive server

Scaffolding Optimal pour les Régions Répétées Inverses-Complémentaires de Génome de Chloroplastes

Author: Andonov Rumen
Epain Victor
Lavenier Dominique
Publication venue: HAL CCSD
Publication date: 05/07/2022
Field of study

International audienceScaffolding step in the genome assembly aims to determine the order and the orientation of a huge number of previously assembled genomic fractions (contigs/scaffolds). Here we introduce a particular case of this problem and denote it by Nested Inverted Fragments Scaffolding (NIFS). We formulate it as an optimisation problem in a particular kind of directed graph that we call Multiplied Doubled Contigs Graph (MDCG). Furthermore, we prove that the NIFS problem is NP-Hard. We also discuss how the chloroplast data have been generated by filtering the reads sequenced both from plants and chloroplasts. Moreover, we propose a graph structure to visualise the solution and to highlight the particularity of chloroplast's regions structure

INRIA a CCSD electronic archive server

Optimal de novo assemblies for chloroplast genomes based on inverted repeats patterns

Author: Andonov Rumen
Epain Victor
Lavenier Dominique
Publication venue: HAL CCSD
Publication date: 12/07/2021
Field of study

International audienceBackground Chloroplast genome assembly remains challenging because sequencing step outputs short reads both from plant and plastid genomes. Some recent dedicated assemblers [1,2] use the information of a highly conserved circular and quadripartite structure with a pair of dispersed inverted repeat regions in chloroplast genomes. Materials and methods We designed a dedicated pattern-driven de novo assembler which requires short unpaired reads uniquely (distances provided by paired-reads are not needed), sequenced from both the plant and its chloroplasts. A first step consists in separating the chloroplasts reads from the reads specific to plant. To this end we use the observation that the chloroplast genomes are over-represented compared to the plant genome. Then we compute an estimated coverage of the pre-assembled contigs and we keep the ones with higher coverage. The first step outputs an assembly graph where each vertex corresponds to a contig and is provided with an estimated multiplicity number. In the sequel we use another graph where each vertex is duplicated according to its multiplicity number and to the two possible contig orientations. The edges are duplicated respectively. In our approach the genome assembly is modelled as finding an elementary path in this graph. We formulate the dispersed repeats as linear constraints and we search for an elementary path using Integer Linear Programming similarly to [3]. In our approach inverted repeats correspond to occurrences of contigs paired with other occurrences of them but in reverse orientation. Their positions on the assembled sequence must satisfy nested-pairs pattern. We formulate the above constraints in terms of linear program where the objective is to maximize the nested-pairs number. Thus, we generalize a similar approach applied for RNA folding [4]. Indeed, in contrast to the later approach where the vertices correspond to bases with known sequence indices, in our case the positions of the contigs are variables. Our tool is implemented with Python 3 and uses the open-source PuLP package which integrates a free solver to solve the above optimization problem. Results We tested our program with QUAST [5] and we obtained very encouraging preliminary results, with high genome coverage (mostly >99%), and very low mismatches and indels rates. Conclusions We designed a chloroplast genome dedicated pattern-driven de novo assembler using only short unpaired reads. We formulate the conserved circular and quadripartite structure as linear constraints and implemented this model in an open-source program. Finally, QUAST evaluation returned some encouraging preliminary results

INRIA a CCSD electronic archive server

Graphe de Chevauchements pour les Algorithmes d'Assemblage et de Scaffolding : Etat de l'Art des Paradigmes et Propositions d'Implementations

Author: Epain Victor
Publication venue: HAL CCSD
Publication date: 01/01/2022
Field of study

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Assemblage de novo de longues lectures par la programmation mathématique linéaire

Author: Epain Victor
Publication venue: HAL CCSD
Publication date: 05/09/2019
Field of study

Hristo Djidjev : collaborateur d'équipe associée HipcoGenInternational audienceIn silico studying a genome requires two steps: sequencing it with cloning and cutting the genome in several reads, and then, assembling the reads. It is well known that the number of sequencing errors is proportional to the reads' size. However, the use of long reads can be an advantage against genome repeated regions issues. De novo is an assembly method which does not use a reference. The purpose of the described here tool, named LOREAS, is to be a de novo assembler in two tasks: first, ordering the long reads, and then, obtaining a consensus sequence of the ordered reads. Currently, only the first task was realised. While other de novo long reads assemblers use heuristics and De Bruijn graphs, LOREAS is based on overlaps similarity between all the long reads. It uses integer linear programming, to find the heaviest path in a graph

G= (V,E,λ)

, where V is the vertices set corresponding to the long reads set, E the set of edges associated with the overlaps between long reads – weighted by λ: the overlap length. When this graph is too huge, the set of reads V is partitioned in several parts. Then, all the parts are solved sequentially. Here we present the solution concerning the first task related to ten bacteria genomes. Seven of them have been successfully solved for less than 12 minutes on a laptop.Étudier insilico un génome nécessite deux principales tâches: le séquencer, en le clonant puis en le découpant en plusieurs lectures, puis assembler les lectures. Le serreurs de séquençage dépendent de la taille des lectures générées: le taux d'erreur pour les longues lectures est plus important que celui des courtes lectures. Toutefois, les longues lectures permettent de contrer les problèmes issus des régions génomiques répétées. L'assemblage de novo est une méthode qui n'a pas besoin de référence. Le programme présenté LOREAS, a pour but d'être un assembleur de novo en deux étapes: la première consiste à donner un ordonnancement des longues lectures, la deuxième, réaliser une séquence consensus des lectures ordonnancées. Pour le moment, seule la première étape fut réalisée. Alors que d'autres assembleurs de novo usent d'heuristiques et des graphes de De Bruijn, LOREAS est basé sur les similarités de chevauchements entre toutes les lectures. À cette fin, la programmation linéaire en nombres entiers permet de trouver le plus lourd chemin dans un graph

G= (V,E,λ)

, où V est l'ensemble des sommets qui sont les longues lectures, E l'ensemble des arcs représentant les chevauchements entre les longues lectures-pondérés par λ, la longueur de chevauchement. Si le graphe précédent est trop important, l'ensemble V est partitionné en parties distinctes, puis toutes les parties sont résolues séquentiellement. Dix génomes de bactéries simulés séquencés furent résolus pour la tâche d'ordonnancement des longues lectures. Il en résulte sept résultats positifs sur dix obtenus en moins de 12 minutes sur un ordinateur portable

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Assemblage de fragments ADN : structures de graphes et échafaudage de génomes de chloroplastes: Analyses comparatives, formulations et implémentations

Author: Epain Victor
Publication venue: HAL CCSD
Publication date: 27/12/2023
Field of study

To obtain the nucleotide sequence of a DNA molecule, the molecule is fragmented using sequencing technology and the fragments are assembled. These fragments are called reads. They are subject to sequencing errors and must be considered in two orientations: that of their original DNA strand, or the reverse-complementary for the other strand. Assembly is based on pairwise overlaps between oriented reads and consists of three phases: assembling the reads to obtain contigs (sequences longer than the reads), scaffolding the contigs to obtain scaffolds (orders of oriented contigs), and completing the scaffolds (finding the nucleotide sequences separating the oriented contigs in the scaffolds). In this thesis, we compare graph structures representing succession relations between oriented DNA sequences, useful at different phases of assembly. Then, we address the scaffolding problem dedicated to chloroplast genomes by proposing a new formulation, an exact resolution and an implementation.L'obtention de la séquence nucléotidique d'une molécule ADN nécessite sa fragmentation par des technologies de séquençage et l'assemblage des fragments. Ces fragments sont appelés lectures. Elles souffrent d'erreurs de séquençage et sont considérées sous deux orientations : celle de leur brin ADN d'origine ou l'inverse-complémentaire pour l'autre brin. L'assemblage se base sur des chevauchements deux à deux entre des lectures orientées, et est composé de trois phases : l'assemblage des lectures pour obtenir des contigs (des séquences plus longues que les lectures), l'échafaudage des contigs, pour obtenir des échafaudages (des ordres de contigs orientés), et la complétion des échafaudages (trouver les séquences de nucléotides séparant les contigs orientés dans les échafaudages). Dans cette thèse, nous comparons des structures de graphes représentant des relations de successions entre des séquences ADN orientées, utiles à différentes phases de l'assemblage. Puis, nous nous penchons sur le problème de l'échafaudage dédié aux génomes de chloroplastes en proposant une nouvelle formulation, une résolution exacte et une implémentation

HAL-CentraleSupelec